difficult problem
QueST: Incentivizing LLMs to Generate Difficult Problems
Hu, Hanxu, Zhang, Xingxing, Vamvas, Jannis, Sennrich, Rico, Wei, Furu
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We leverage QueST to generate large-scale synthetic coding problems, which we then use to distill from strong teacher models with long chain-of-thought or to conduct reinforcement learning for smaller models, proving effective in both scenarios. Our distillation experiments demonstrate significant performance gains. Specifically, after fine-tuning Qwen3-8B-base on 100K difficult problems generated by QueST, we surpass the performance of the original Qwen3-8B on LiveCodeBench. With an additional 112K examples (i.e., 28K human-written problems paired with multiple synthetic solutions), our 8B model matches the performance of the much larger DeepSeek-R1-671B. These findings indicate that generating complex problems via QueST offers an effective and scalable approach to advancing the frontiers of competitive coding and reasoning for large language models.
- Europe > Austria > Vienna (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- (4 more...)
Efficient Prediction of Pass@k Scaling in Large Language Models
Kazdan, Joshua, Schaeffer, Rylan, Allouah, Youssef, Sullivan, Colin, Yu, Kyssen, Levi, Noam, Koyejo, Sanmi
Assessing the capabilities and risks of frontier AI systems is a critical area of research, and recent work has shown that repeated sampling from models can dramatically increase both. For instance, repeated sampling has been shown to increase their capabilities, such as solving difficult math and coding problems, but it has also been shown to increase their potential for harm, such as being jailbroken. Such results raise a crucial question for both capability and safety forecasting: how can one accurately predict a model's behavior when scaled to a massive number of attempts, given a vastly smaller sampling budget? This question is directly relevant to model providers, who serve hundreds of millions of users daily, and to governmental regulators, who seek to prevent harms. To answer this questions, we make three contributions. First, we find that standard methods for fitting these laws suffer from statistical shortcomings that hinder predictive accuracy, especially in data-limited scenarios. Second, we remedy these shortcomings by introducing a robust estimation framework, which uses a beta-binomial distribution to generate more accurate predictions from limited data. Third, we propose a dynamic sampling strategy that allocates a greater budget to harder problems. Combined, these innovations enable more reliable prediction of rare risks and capabilities at a fraction of the computational cost.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)
R1: Comparison with inexact methods Aligning with prior exact papers [10, 18], we focus on comparisons with exact
We thank all five reviewers for their detailed and incisive feedback. We tested AustereMH [16], an inexact method, on robust linear regression in Section 5.1 with We added this to the Appendix. This does not affect the properties of TunaMH. Our theorem doesn't have this assumption; it suggests that for MHSubLhd with given user-specified The impact is 3-fold: it (1) provides an upper bound on performance for algorithms of Algorithm 1's TunaMH); (3) suggests directions for developing new algorithms. To be significantly faster than TunaMH, we either need more assumptions about the problem or new stateful algorithms.
Long Is More Important Than Difficult for Training Reasoning Models
Shen, Si, Huang, Fei, Zhao, Zhixiao, Liu, Chang, Zheng, Tiansheng, Zhu, Danhao
Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6\% accuracy on MATH, and 71.1\% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at https://huggingface.co/ZTss/LONG1.
- Asia > China > Jiangsu Province > Nanjing (0.05)
- North America > United States (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
Reviews: Generative Modeling by Estimating Gradients of the Data Distribution
The paper proposes to perform Langevin dynamics in data space (as opposed to the latent space) of a deep generative model as a means to explore the data distribution. This reduces the difficult problem of estimating the data distribution to the slightly less difficult problem of estimating its gradients. The latter ones are estimated by different versions of score matching. This paper mainly builds on recent work on score matching by random projections. As a result, a new generative model is achieved whose sample quality is similar to GANs, while avoiding an adversarial training paradigm. This is a strong contribution.
CppFlow: Generative Inverse Kinematics for Efficient and Robust Cartesian Path Planning
Morgan, Jeremy, Millard, David, Sukhatme, Gaurav S.
In this work we present CppFlow - a novel and performant planner for the Cartesian Path Planning problem, which finds valid trajectories up to 129x faster than current methods, while also succeeding on more difficult problems where others fail. At the core of the proposed algorithm is the use of a learned, generative Inverse Kinematics solver, which is able to efficiently produce promising entire candidate solution trajectories on the GPU. Precise, valid solutions are then found through classical approaches such as differentiable programming, global search, and optimization. In combining approaches from these two paradigms we get the best of both worlds - efficient approximate solutions from generative AI which are made exact using the guarantees of traditional planning and optimization. We evaluate our system against other state of the art methods on a set of established baselines as well as new ones introduced in this work and find that our method significantly outperforms others in terms of the time to find a valid solution and planning success rate, and performs comparably in terms of trajectory length over time. The work is made open source and available for use upon acceptance.
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
Accelerating laboratory automation through robot skill learning
Transforming materials discovery plays a pivotal role in addressing global challenges. The applications of new materials could range from clean energy storage, to sustainable polymers and packaging for consumer products towards a more circular economy, to drugs and therapeutics. Stemming from the COVID-19 pandemic, where scientists had to halt experiments due to stringent social distancing measures or accelerate their efforts towards quickly producing a vaccine, there has recently been an increased interest in using robotics and automation in laboratory environments. The challenge here is that laboratories have been designed by and for humans and thus the available glassware, tools and equipment pose difficult problems for traditional automation methods that are inherently open loop and not adaptable. Learning-based methods that rely on autonomous trial and error are increasingly being used to achieve robotic tasks that could not be previously addressed with automation.
A gentle Introduction to Bayesian Inference
In this article, we have seen the Bayesian approach in action with the help of a small example. It uses prior knowledge and updates it with observed data to create a posterior, exactly like humans intuitively do. This approach is better than discarding the data and just proceeding with some prior, obviously. It is even more powerful than the maximum likelihood method: you can see this by choosing a flat prior, i.e. the prior gives the same probability (or density) to every possible value θ and is essentially a constant. Furthermore, the Bayes method even gives you a distribution of the parameters, while the maximum likelihood method does not.
Why Speech Separation is Such a Difficult Problem to Solve
You are talking on the phone, or recording an audio, or just speaking to voice assistants like Google Assistant, Cortana, or Alexa. But the person on the other side of the call cannot hear you because you are in a crowded place, the recorded audio has a lot of background noise, or the "Hey, Alexa" call wasn't picked up by your device because someone else started speaking. All of these problems related to separating voices, informally referred to as the "cocktail party problem", have been addressed using artificial intelligence and deep learning methods in recent years. But still, separating and inferring multiple simultaneous voices is a difficult problem to completely solve. To start, speech separation is extracting speech of the "wanted speaker" or "speaker of interest" from the overlapping mixture of speech from other speakers, also referred to as'noise'.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.73)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.56)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.56)